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Deeds, or charters, dealing with property rights, provide a con- 
tinuous documentation which can be used by historians to study the 
evolution of social, economic and political changes. This study is con- 
cerned with charters (written in Latin) dating from the tenth through 
early fourteenth centuries in England. Of these, at least one million 
were left undated, largely due to administrative changes introduced 
by William the Conqueror in 1066. Correctly dating such charters 
is of vital importance in the study of English medieval history. This 
paper is concerned with computer-automated statistical methods for 
dating such document collections, with the goal of reducing the con- 
siderable efforts required to date them manually and of improving 
the accuracy of assigned dates. Proposed methods are based on such 
data as the variation over time of word and phrase usage, and on 
measures of distance between documents. The extensive (and dated) 
Documents of Early England Data Set (DEEDS) maintained at the 
University of Toronto was used for this purpose. 

1. Introduction. Our object in this paper is to contribute toward the de- 
velopment of statistical procedures for computerized calendaring (i.e., dat- 
ing) of text-based documents arising, for example, in collections of historical 
or other materials. The particular data set which motivated this study is the 
Documents of Early England Data Set (DEEDS) maintained at the Centre 
for Medieval Studies of the University of Toronto. This data set consists 
of charters, that is, documents evidencing the transfer and/or possession of 
land and/or movable property, and the rights which govern them. The doc- 
uments in question date from the tenth through early fourteenth centuries 
and are written in Latin, the administrative language of their time. They 
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were mostly obtained from cartularies and charter collections produced in 
England and Wales, with a few from Scotland. 

A peculiarity of that era is that most of the charters that were issued do 
not bear a date or other chronological marker. This is particularly so from 
the time of the Conquest in 1066, until about 1307, when fewer than 10% of 
the more than one million surviving charters bore dates. (A more complete 
background to these circumstances is provided in Section 2.) Charters dat- 
ing from the twelfth and thirteenth centuries, however, are a vital source for 
the study of English social, economic and political history, and significant 
historical information can be derived when such charters can be dated or 
sequenced accurately. (For some examples, see Section 2.) The charters com- 
prising the DEEDS data set are derived from among those charters which 
can in fact be accurately dated, and, specifically, to within a year of their 
actual issue. A key aim of the DEEDS project was to produce a reliable data 
base from which methods for dating the undated charters could be devised. 

The DEEDS data set currently consists of some 10,000 documents, in 
computer readable form, taken from published editions of charter sources. 
These have all been dated by historians on the basis of internal dates or 
other internal chronological markers such as person or place names, or refer- 
ence to a datable event. (Note, however, that dating manually, for instance, 
by comparing names, is prone to errors which can multiply when charters 
are used to date other charters; not infrequent names such as "William son 
of Richard son of William son of Richard" can easily be generationally mis- 
aligned.) One key idea underlying our work is that changes in language use 
across time can be used to help identify the date of an undated document. 
For example, a study of dated charters shows that the phrase ^^amicorum 
meorum vivorum et mortuorum" ("of my friends living and dead") was in 
currency between the years 1150 and 1240. As another example, the phrase 
"Francis et Anglicis" (a form of address: "to French and English") was 
phased out when Normandy was lost by England to the French in 1204. 
By combining evidence from many words and phrases, and/or by examining 
measures of distance between documents, our goal is to develop algorithms 
to help automate the process of estimating the dates of undated charters 
through purely computational means. 

In Section 2 we provide further historical background concerning the char- 
ters with which the DEEDS data set is concerned. We explain there how 
it happened that so many charters had been left undated, and indicate the 
importance that dating charters correctly has for research into the social, 
economic and political history of England in the high middle ages. Following 
this, we provide a more detailed description of that part of the DEEDS data 
set on which our work was based. 

In Section 3 we first briefly discuss some concepts relevant for statistical 
processing of text-based documents, and set down the notation to which we 
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will adhere throughout. We then review some previous calendaring work that 
had been carried out using the DEEDS data set. In Sections 4, 5 and 6 we 
discuss three distinct methods for calendaring undated charters. The meth- 
ods described in Section 4 are based on nearest neighbors (kNN); essentially, 
these methods average the dates of documents in a training set which have 
known dates, and which are "closest" to the one being dated. This approach 
requires notions of distance between documents which we also discuss there, 
as well as the selection of tuning parameters using cross-validation. The 
method proposed in Section 5 is based on an analogue of maximum likeli- 
hood which we refer to as the method of maximum prevalence (MP). This 
method attempts to assign a probability, at every point in time, that the 
document would have randomly been produced then, and it estimates the 
date of the document to be the time at which this probability is greatest. 
Finally, in Section 6, we propose a method based on determining the mini- 
mum of a nonparametric quantile regression curve fitted to a scatterplot of 
the distances from a document to be dated to the documents in a test set, 
against the known dates of those test documents. Some asymptotic theory 
for the estimation methods is discussed briefly in Section 7, and based on 
the three statistical methods discussed, numerical work we carried out using 
the DEEDS data set is described in Section 8. Some concluding remarks are 
provided in Section 9 where avenues for further work are also indicated. 

The method discussed in Section 2 is due to R. Fiallos, but is discussed 
here in statistical terminology and in greater detail than in Fiallos (2000). 
The methods reviewed in Section 4 are from Feuerverger et al. (2005, 2008) 
and are included here for comparison and completeness. The maximum 
prevalence method described in Section 5 is our main new methodological 
contribution. As well, a key contribution of our work lies in the novel appli- 
cation of the mentioned estimation methods to historical data of the type 
considered here. This work may be seen in the context of other work in the 
digital humanities, temporal language modeling and information retrieval. 
Some entry points to that literature in the context of calendaring documents 
include de Jong, Rode and Hiemstra (2005), Kanhabua and Norvag (2008, 
2009) and the references therein. For broader context see, for example. Berry 
and Browne (2005) and Manning, Raghavan and Schiitze (2008). 

2. Description of the data set. The keeping of records pertaining to the 
ownership and transfer of property is as old as writing itself, and dates back 
to at least the third millenium BC in Sumeria where such documents were 
inscribed on clay. Consequently, deeds, or charters (as they are known), 
provide a continuous legal documentation which can be used by historians 
to study the evolution of social, economic and administrative changes. For 
charters to be used in this way, however, establishing an accurate chronol- 
ogy is important. Below, we will use the term charter to represent an official 
legal document, often written or issued by a religious, lay or royal institu- 
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tion, which typicahy provides evidence of the transfer of landed or movable 
property and the rights which govern them. 

It was the fate of England, between the time of the Conquest in 1066 
when William the Conqueror (also Duke of Normandy) ascended the English 
throne, until the start of the reign of Edward I in 1307, that — in contrast to 
the Roman and papal traditions — most charters issued did not bear a date 
regardless of the level of society in which the charters originated. William 
I introduced into the royal chancery the then-current Norman custom of 
issuing charters without dates or other chronological markers. This custom 
continued until the reign of England's sixth post-Conquest (and crusader) 
king — Richard the Lionhearted (1189-1199) — when, for the first time, doc- 
uments issued from the royal chancery began regularly to include a date. It 
was, however, not until the accession of the tenth king, Edward II (1307- 
1327), that the custom of including dates also became universally adopted 
by those responsible (ecclesiastics and laymen) for issuing private charters. 

Charters from the twelfth and thirteenth centuries, written in Latin — 
the administrative language of the time — are the predominant source for 
the study of English social, economic and political history of that era. It is 
estimated that at least one million charters have survived from that nearly 
250 year period, some as originals, but most as copies in cartularies (i.e., 
deed books). Of these, well over 90 percent do not bear dates, so that fewer 
than 10% of them can be dated at all accurately. Although increasingly less 
so with the passage of time, even at the turn of the fourteenth century the 
percentage of English charters bearing dates remains modest. 

Significant historical information can be derived when charters can be 
dated or sequenced correctly as the following three examples attest: (i) 
A study of donations to the twelfth-century Order of the Hospital of Saint 
John of Jerusalem allowed historians to conclude that the Order became 
militarized in response to the fall of Edessa in 1144, and to the call for the 
Second Crusade in 1145. (ii) Widespread reluctance to incorporate the in- 
vocation of divine intervention into legal language of the day evidences the 
social unrest in England under the Papal Interdict of 1208-1214. (iii) With 
the Crusades came the foundation of the military-religious orders known 
as the Templars and the Hospitallers who financed their activities in part 
through the management of properties in Europe and the Middle East. The 
relative growth of their estates in London and its suburbs from the twelfth 
to the fourteenth centuries confirms without a doubt that as London spread 
outside its ancient Roman walls in the twelfth century, the Templars played a 
far more significant role in suburban development, and from a much earlier 
period, than did the Hospitallers. Further background and examples may 
be found in Gervers (2000), Gervers and Hamonic (2010), and references 
therein. 

The DEEDS database, maintained at the University of Toronto, is now 
a corpus of over 10,000 medieval Latin charters dealing primarily with land 
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and movable objects (grants, leases, agreements, etc.) and rights regulating 
their use. The charters in this corpus are all dated ones; they were either 
dated internally or they contained sufficient information to enable histori- 
ans to situate them to within a year of their issue. These charters were all 
obtained from published editions of charter sources covering England and 
Wales, and a few from Scotland, and were derived predominantly from the 
archives of religious houses and towns, as well as lay institutions such as 
colleges and universities. (Note that because the charters were taken from 
published sources, they necessarily bear any editorial decisions made by the 
publishing author.) The DEEDS project has, as a key objective, to establish 
computerized methodologies for dating the vast number of medieval charters 
that have not yet been dated in the hope that, taken together, the dated 
documents from the database, and those to which dates can be attributed 
via statistical and other means, may allow historians to construct a more 
precise understanding of the evolution of English society within that era. 
We remark that due to the paucity of surviving documents, and the rarity 
among them of charters bearing dates, there is very little in the DEEDS 
database from before 1160. 

Original charters, written on parchment, and bearing the seal of the issuer 
or his patron, are rare. Most of the charters that have survived today exist as 
copies in deed books known as cartularies which were produced periodically 
during the eleventh to fifteenth centuries. (Such copying could occasionally 
introduce transcriptional or other changes and inaccuracies.) Consequently, 
palaeography and sigillography generally cannot help in the calendaring pro- 
cess, leaving the evolution over time of vocabulary usage, word patterns and 
document structure as the primary data from which dating can be carried 
out. These charters are preserved today in such repositories as the National 
Archives, the British Library, the archives of Oxford and Cambridge Uni- 
versities, in county record offices and in private collections. 

The data: Although the DEEDS data set has grown, 3353 documents 
were available to us when our computations were implemented; we now de- 
scribe this data set. Prior to their analyses, certain preprocessing steps were 
applied. Dates were mapped into the Julian calendar. Documents were nor- 
malized for variations in spelling, and all punctuation marks were removed. 
Names were left unchanged, and just as they appeared in the document. 
All numbers appearing in a document were encased between exclamation 
signs — thus, xv became Ixvl — and all numbers were subsequently treated as 
being the same distinct word. (We are not referring here to actual dates 
which might appear in a document allowing it to be dated without diffi- 
culty.) The determination of distinct words was taken to be case sensitive; 
this rule was applied even to the first words of sentences whose first charac- 
ter was generally in upper case. A sample of a document processed in this 
way is provided at the end of this section. 
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Fig. 1. Histogram for the distribution of dates of the 3353 dated documents. 

Figures 1 and 2, as well as Table 1, provide some graphical and tabular 
information about our 3353 dated DEEDS documents. Figure 1 is a his- 
togram of the known dates for the documents; the earliest of these is dated 
1089, and the latest is dated 1438. The mean date of these charters is 1237 
with a standard deviation of 46 years. Figure 2 is a histogram of the lengths 
(i.e., word counts) of the documents; the shortest of these consisted of only 
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Fig. 2. Histogram for the distribution of lengths (word counts) of the 3353 documents. 
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Table 1 

Frequency of word repetitions in the data set of 3353 documents, 
comprising 50,006 distinct words 



Word frequency Number of occurrences 



Words 


occurring 


; only once 


28,282 


Words 


occurring 


; exactly twice 


7223 


Words 


occurring 


; exactly three times 


3265 


Words 


occurring 


; more than three times 


11,236 


Words 


occurring 


; more than 10 times 


4952 


Words 


occurring 


; more than 30 times 


2330 


Words 


occurring 


; more than 100 times 


1004 


Words 


occurring 


; more than 300 times 


415 


Words 


occurring 


; more than 1000 times 


109 



15 words, and the longest of 2054 words; the median and mean of the word 
counts were 202 and 237, respectively, while the lower and upper quartiles 
were 151 and 275 words. Very short or very long documents are rare. Words 
consisted of an average of 6.5 characters. No dependencies worthy of note 
were detected between the lengths of the documents with their dates, their 
contents or with any other features. 

Among the 3353 documents, a total of 50,006 distinct words occurred. Of 
these, 28,282 words (56%) occurred only once. Words which occurred only 
once were not considered relevant for our study because such words could 
not simultaneously occur in both a test subset and a validation subset of 
the data. The frequency of repetition for repeated words is given in Table 1. 
While it is possible that in a few instances such repetitions all occurred 
within the same document, we did not keep track of such occurrences. 

Finally, we exhibit here one of the DEEDS charters after preprocessing 
as indicated above. This document deals with the transfer of a messuage 
(house and appurtenances) in Nottingham for an annual payment of four 
pounds sterling. It bears serial number 00650032 in the DEEDS data set 
and has been dated internally by regnal year to 1230-1231: 

Omnibus sancte matris ecclesie filiis ad quos presens scriptum pervenerit Simon 
abbas de Rujjord' et conventus eiusdem loci salutem Noverit universitas vestra nos 
dedisse concessisse quiete clamasse et hac presenti carta nostra confirmasse Jo- 
hanni filio Bele de Notingha ' unum mesuagium cum pertinentiis in Notingha ' quod 
jacet inter terram Walteri Karkeney et terram Ade de Estweyt habend et tenend 
eidem Johanni et heredibus suis et heredibus eorum in feodo et hereditate de nobis 
vel atornatis nostris libere quiete integre pacifice et honorifice reddendo inde an- 
nuatim nobis vel atornatis nostris quatuor solidos sterlingorum ad duos terminos 
anni scilicet duos solidos ad Pentecosten et duos solidos ad festum sancti Martini 
pro omni servicio consuetudine seculari demanda et exactione Et nos predictam 
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terram cum pertinentiis predicto Johanni et heredihus suis vel assignatis suis vel 
heredibus eorum contra omnes homines warantizabimus sicut donatores nostri pre- 
dictam terram nobis warantizabunt Ut autem hec nostra donacio et concessio rata 
et stabilis imposterum permaneat hanc presentem cartam sigillo nostro roboravimus 
Hiis testibus Willelmo Brian Astino filio Alicie prepositis Burgi Anglico de Notinga' 
anno regni Regis Henrici filii Johannis Regis !xv! Henrico Kytte Henrico le Taylur 
Augustino clerico et aliis. 

3. Previous work. In this section we describe some previous work on the 
problem of calendaring undated English charters that had been carried out 
using the DEEDS data set. First, however, we define some basic terms and 
set out the notation that we will adhere to throughout. 

We will use T> to denote a generic text document; T> will frequently be 
considered to be random — a selection from an effectively infinite collection of 
documents that could have arisen in the relevant random experiment. Our 
data corpus will typically be denoted as Vi,V2, ■ ■ ■ jVn; our notation will 
not distinguish whether these represent random documents or their actual 
realizations, as this will always be clear from the context. 

A document P consists of a string (ordered sequence) of not necessar- 
ily distinct words {wi,W2, ■ ■ ■ ,Wm), where N[T>) = |D| = m is the length 
of the document. A shingle of size k, or fe-shingle, is a substring Sk = 
{'Wj+i,Wj+2, ■ ■ ■ iWj+k) of k consecutive words in T); here < j <m — k so 
there are m — k + 1 (not necessarily distinct) /c-shingles in T>. We will let 
SkiT)) denote the set of these (not necessarily distinct) /c-shingles, while 
Ski'D) will denote the set of distinct fc-shingles of T>. The cardinalities of 
these sets is \Sk{T>)\ < \sk{T>)\ = m — k + 1. When k is considered to be 
fixed, and given a /c-shingle s € Ski'D), we will let ns{'D) denote the number 
of times this shingle occurs in Sk{T>)] Finally, the date, t, of a document will 
be denoted by t{T>) = t. 

Turning now to previous work on the DEEDS data, Rodolfo Fiallos worked 
for the DEEDS project for many years, during which time he devised a 
method for dating the manuscripts called the MT method. See Fiallos (2000). 
MT stands for Multiplicador Total in Spanish and translates into English as 
"Total Multiplier." Fiallos' method is based on matching patterns — shingles 
of arbitrary length — which occur in the document we seek to date and which 
occur also in one or more of the documents in a training set of dated doc- 
uments. The underlying idea is that a relatively higher concentration of 
matching patterns should be found among those documents in the training 
set whose dates are closer to the unkown date of the document whose date 
we are trying to estimate. Fiallos identified three characteristics of matching 
patterns thought to be important for the calendaring process: 

Length: The number of words in the matching pattern (shingle length). 
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Lifetime: The difference, in years, between tlie last and first occurrence 
of ttie matcliing pattern in tlie training set. (If a pattern occurs only within 
one year, its Lifetime = 0.) 

Currency: The Lifetime of the matching pattern divided by the number 
of distinct years in which it occurs. (Here we are following the definition of 
R. Fiallos: thus higher values of currency correspond to sparser occurrence 
of the pattern throughout the years of its lifetime.) 

To date a given document P, every substring of consecutive words in D 
is examined. [If P has length m, there will be m + (m — l) + -- - + 2 + l = 
m(m + l)/2 such substrings in all.] If such a substring occurs also in the 
training set (it becomes a "matching pattern" and) it produces an MT value 
defined as 

MT = Ml (Length) x M2 (Lifetime) x M3 (Currency). 

The larger its MT value, the more influential the matching pattern is con- 
sidered to be for the calendaring process. Here the function Mi is increasing 
since longer patterns are considered to be more informative; M2 is decreas- 
ing since patterns with longer lifetimes are viewed as being less informative; 
and finally M3 is also decreasing since sparser occurrence of a pattern within 
its lifetime is thought to reduce its evidentiary worth. The functions Mi, M2 
and M3 can be defined in many ad hoc ways, and such definitions invariably 
entail many tuning-type parameters; such functions and their parameters 
were determined by Fiallos through extensive trial and error and leave-one- 
out cross-validation. 

Once MT values have been assigned to all matching patterns in D, an 
MT value is computed for every year for which training data is available 
by summing the MT values of all of the matching patterns of V that occur 
among the training data of that year. However, in an attempt to reduce 
noise, matching patterns whose MT values fall below a certain threshold 
are excluded from this summing process. This procedure leads to a function 
of time, called the MT function. To account for the fact that the number 
of training documents varies over time, the values of this MT function are 
each divided by the number of training documents in that year. These stan- 
dardized values are referred to as Global MT or GMT values. In principle, 
the date having the highest GMT value is taken to be the estimated date 
of "D. However, because such GMT functions are still quite noisy, the GMT 
values are first averaged over time intervals of, say, 40 or 20 years, leading 
to an estimated time interval for the date of 2?. This estimated date range is 
then expanded, and the GMT averaging process is then repeated over this 
new range but now using a smaller interval width. This process is repeated 
several times, leading finally to a point estimate for the unknown date. 

Figure 3 (based on computations provided by Fiallos) plots the estimated 
versus the actual dates for 1484 DEEDS documents which were dated by 
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Fig. 3. Estimated versus true dates for 1484 documents, dated by the method of R. 
Fiallos, each selected randomly from a training set of approximately 3500 dated documents. 

the MT method. These 1484 documents were randomly selected from a set 
of approxhnately 3500 dated documents, and each of these 1484 selected 
documents was then dated on the basis of the full 3500 documents data set, 
but with the one being dated left out. The mean absolute error (MAE) was 
found to be 16 years. The heavy concentration of points occurring near the 
"x = y" axis is due to documents that have been dated rather accurately. We 
remark, however, that the MAE estimate of 16 years is likely to be optimistic 
because it was not based on a held-out test set — that is, the optimization of 
the many tuning parameters was performed over the same data set. 

4. Calendaring by nearest neighbors (kNN). Distance based methods 
for calendaring charters (also referred to as nearest neighbor or kNN meth- 
ods) were introduced in Feuerverger et al. (2005, 2008), hereafter referred to 
as FHTG (2005) and FHTG (2008). The underlying idea is to define mea- 
sures of distance between pairs of documents and to estimate the date of 
an undated document by a weighted average of the dates of documents in a 
training set using weights which depend on their distances to the document 
we seek to date. Alternately, one can use a reciprocal to the concept of dis- 
tance, namely, similarity (also referred to as resemblance or correspondence), 
and average over the dates of documents in the training set using weights 
based on the similarity measures. For completeness and later comparisons, 
we outline these methods in this section. 

Measures of distance and similarity: Distance and similarity measures on 
documents are discussed, for example, in Djeraba (2003), FHTG (2005), 



DATING MEDIEVAL ENGLISH CHARTERS 



11 



McGill, Koll and Noreault (1979), Quang et al. (1999), Salton, Wang and 
Yang (1975), Tan, Steinbach and Kumar (2005), Zhang and Korfhagen 
(1999) and references therein. Let V and Q represent two documents whose 
union consists of |PU Q| = I distinct words. (A discussion based on A;-shingles 
would be analogous.) Let p = (pi, . . . ,pi) and q = {qi, . . . ,qi), respectively, 
be vectors corresponding to the occurrence of these distinct words; these 
vectors can variously be word counts, normalized counts (^jPi = qi = 1) 
or 0-1 incidence vectors. Then some natural measures of similarity between 
V and Q are given by 

(4.1) Sim^(7',Q)- T.^=lVlql 



27 /v-^ 27 



for < 7 < oo. The case 7 = 1 corresponds to the angle-based cosine simi- 
larity, while the case 7 = 1/2 with normalized p and q results in a similarity 
measure that leads to a Hellinger distance. Similarity measures somewhat 
alike to (4.1) may also be defined as 



(4.2) Sim„(7',Q)- ^^^^^^ 



for < a < 00. Unlike (4.1), these have the advantage that, for all such 
values of a, 

(4.3) Dist„(P,Q) = l-Sim„(P,Q) 

is a proper metric (i.e., satisfies the triangle inequality). 

Broder (1998) defined the resemblance of two documents Di and P2, foi^ 
a given (fixed) shingle size k, as 

(4.4) ^^^^(^^'^^)=|5.(P,)U5.(P,)r 

Using this definition, a set-based resemblance distance between documents 
which satisfies the triangle inequality may be defined as 

Distfc(Pi,P2) = 1 - Resfc(Pi,p2). 

There are, of course, may other measures of distance and similarity. We 
remark that for information retrieval work, many distance measures often 
behave similarly and that whether or not the triangle inequality holds tends 
to be inconsequential. [See, e.g., Djeraba (2003), Chapter 4.] One poten- 
tial benefit, however, of having many versions of distance is in permitting 
the implementation of ensemble- type estimation methods. The use of mul- 
tidimensional scaling as an alternative to incorporate distances based on 
similarities is also worth mentioning, but lies outside the scope of this pa- 
per. 
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Calendaring by kNN methods: To develop and evaluate distance based and 
other estimation methods, the DEEDS documents were first partitioned at 
random into a training set T, a validation set V and a test set A. We will 
frequently interchange notation such as G T and i (zT for membership 
in these sets. Our aim is to estimate the unknown date ti of a document Vi, 
when i ^ T. Here we fohow FHTG (2005, 2008). 

Let dk{i,j), for k = 1,2, . . . ,r, denote r different distance measures be- 
tween documents Vi and Vj, say. For instance, these distances could all be 
Broder distances corresponding to different shingle lengths k, with r being 
the largest shingle size in the procedure. Using these distances, we define 
an r-dimensional kernel weight on the dates tj of the documents T>j in the 
training set T: 

r 

(4.5) a{i,j) =a{i,j\hi,...,hr) = Y\_Kh^{dk{i,j)), 

k=l 

where i corresponds to the document Vi we seek to date. Here Kh{-) is a non- 
negative, nonincreasing function defined on the positive half-line and h is a 
bandwidth parameter. For example, we could take Kh{u) (x exp{— (ti//i)^}, 
or Kh{u) (X (1 + (u/h)"^)"^ for some choice of rj, with each distance mea- 
sure permitted to have its own bandwidth. The distance based (or kNN) 
estimator for the date tj of Di is then defined as 

V — ^ 2 ^'^J" ^(^ ; J ) 

(4.6) t = ti = arg min y ^ {tj — t) a{i,j) — 



It remains to consider the selection of the bandwidths hi, . . . ,hr in (4.5). 
In FHTG (2005, 2008) this was based on a form of cross-validation which is 
local in the sense that it tries to determine the set of bandwidths optimal 
for each document T>i individually. Specifically, let /C(i) be the collection of 
nearest neighbors to Vi, defined as the union, over all 1 <k <r, of the set 
of all indices j G T in the training set such that dk{i,i) is among the m 
smallest values of that quantity, where the integer m is some small fraction 
of the total number of documents in T. Then m, as well as the hi,. . . ,hr 
specific to Vi, are chosen to minimize the cross-validation function 

(4.7) C\im;hi,...,hr) = -^ ^ (ij'-t-i')', 

where 

= i_ji{m;hi,...,hr) = argmm ^ (tj - tfa{j' ,j) 
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While this bandwidth selection process is local in the sense that for each 
Di , it tries to determine a set of bandwidths by optimizing over its nearest 
neighbors IC{i), if we were to choose all /C(i) = T the procedure would be- 
come global with the estimated bandwidths then being the same for all of 
the documents. 

The optimization over m and hi,. . . ,hr is carried out via a grid search 
resulting in 



The mean squared error of the date estimate tj can then be estimated as 



5. Calendaring by maximum prevalence (MP). Our method of maxi- 
mum prevalence (MP) for calendaring a document V is an analogue of the 
method of maximum likelihood; it attempts to assign, for each point t in 
time, a probability for the occurrence of T> at that time, and it estimates 
the unknown date of V by that value of t at which D has the highest prob- 
ability of occurrence. The MP method is specific to a given shingle size, say, 
k, but the ensemble of estimates produced using different values of k can 
subsequently be combined. 

If now T> consists of a string of N words, it will contain |sfc(P)| = N — k-\-l 
(not necessarily unique) A:-shingles. We will let N{T>) = |sfc(P)| represent the 
number of elements in Sk{'D), suppressing its dependence on k. The assump- 
tion is then made that these N{T>) shingles occur independently of each other 
and are drawn from the multivariate distribution over shingles of size k in 
effect at the true date t{'D) of the document. Although this assumption — 
made here of necessity — is untrue, there are some arguments in its favor. 
In particular, in some statistical problems, estimators can remain consistent 
(and even asymptotically efficient) when dependency is ignored. Examples 
include incorrectly assuming independence when estimating the mean of cer- 
tain stationary processes. In such cases, it is primarily the variances of the 
estimates that are affected. Additional arguments are given in Domingos 
and Pazzani (1996). 

Suppose then that for every possible fc-shingle s, we knew the probability 
7rs(t) of its occurrence at every time point t. Then the prevalence function 
for V is defined as 



(rfi; /ii, . . . , hr) = argminCV(m; /ii, . . . , hr). 



where the t 
as for tj. 




for all 



E/g;c(i)(^i' - i-j>fa{i,j'\hi,. . . X-) 

j' G /C(i), are computed using the same bandwidths 



(5.1) 



14 



G. TILAHUN, A. FEUERVERGER AND M. GERVERS 



and by analogy with maximum likelihood, the true date t(T>) of D would 
be estimated as that value of t at which irviS) is maximized. The function 
7rx>(t) is intended to represent the probability of the occurrence of P as a 
function over time. Of course, we do not know the 7rs(t), but these may be 
estimated, as TTs{t), say, leading to an estimated prevalence function 

(5.2) 7rc(t)= H TTsit), 

and finally to our proposed date estimator 

tx> = arg max ttd (t) . 

t 

We must now consider how to estimate the probabilities TTsiS) of shingle 
occurrence. Given a document T> and a /c-shingle s, the number of times s 
occurs in T> will be denoted by ns[T>). For ns{T>) we postulate the binomial 
model 

C{ns{V)\N{V) = N, t{V) = t) ~ Bin(iV,7r,(i)) 
according to which the probability of the observed value ns{T>) is 

here t{'D) = t is the date of P and N[T>) = N \s the number of /c-shingles it 
contains. In terms of the canonical log-odds parameter 

the logarithm of this probability is 

log (^l^j ) + n,{V)\s{t) - N{V) log[l + exp{A,(t)}]. 

Because the first (combinatorial) term here does not depend on As(t), we 
drop it from subsequent expressions. Hence, given a random sample of doc- 
uments Vi G T, with corresponding dates ti, the log- likelihood function in 
the parameter As(-) is taken to be 

Y,{ns{V^)\s{k) - N{Vi) log[l + exp{A,(t,0}]}. 

We next model the function parameter \s{ ) as a t-local polynomial of de- 
gree p; specifically, for u near t, 

(5.3) A,(n) « /3o + /?i(n - t) + • • • + /3p(n - tf . 

Here the dependence of Xs{-) as well as of the l3Q,...,j3p on t has been 
suppressed. [See, e.g.. Loader (1999).] Finally, we introduce a t-localized 
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version of the log-likelihood, namely, 

(5.4) ^{ns{V,)X,{U)-NiV,)log[l + exp{Xsiti)}]}Khiti-t), 
idT 

which is to be maximized over the /9o, ■ ■ ■ for every given t. The resulting 
estimate /3o for /3q is then taken as our estimate for Xs{t). Here Kh{u) is 
a symmetric weight function which takes on its maximum at u = 0, and is 
nonincreasing as u moves away from the origin. A Gaussian version might 
be K^iu) cx exp{— with h corresponding to its standard deviation. 
More flexibly, we could write Kh^piu) in place of Kh{u), with 

2 X -{u+l)/2 



11 

Kh,u{u)cx{ 1 + 



corresponding to a t-distribution with degrees of freedom; this allows for 
a tail-weight parameter in addition to a scaling. 

If we take the polynomial (5.3) to be of degree p = 0, so that Xs{u) = /3q 
there, and then set the derivative with respect to /3o in (5.4) to zero, we 
obtain (in terms of tt^) the solution 

which is analogous to the estimator of Nadaraya (1964) and Watson (1964). 
If instead we use a polynomial of degree p = 1 in (5.3) (locally linear smooth- 
ing) and set derivatives with respect to /3o and /3i in (5.4) to zero, we obtain 
the pair of equations 

(KP.\ (T,\T^ +\ ^^ N{Vi) exp{/3o + /Sijty^ - t)} 

(5.6) gn.(A)/^.(t.. - t) = g l + exp{/5o + /5i(t.,-t)} ""'^'^^ " 

and 

^ns{Vi){ti-t)Kh{ti-t) 

(5.7) 

These must be solved numerically for /3o and /3i at every t, giving and 
/3i , and we would then take 

^ r.^ exp(/3o) 



1 + exp(/3o) 

We remark that we could alternatively have modeled the data using a 
poisson distribution as in 

n^(P) ~Poisson(/x^(t)A^(P)) 
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and carried out local polynomial fitting using the canonical link parameter 
Xs{t) = log fis(t). [Here we have used fj.s(t) in place of 7rs(t) for the shingle's 
probabilities.] If the local polynomial is taken to be of degree 0, this leads 
again to the Nadaraya-Watson type solution (5.5), with //^(t) =TTs{t). For 
local polynomials of degree greater than 0, the solutions are approximately, 
but not exactly, equivalent to the binomial case. Note that due to their 
exponential family nature, the Hessians associated with these models are 
strictly negative definite; hence, these various equations are well-behaved 
and have unique solutions. 

As a final remark, we mention that one may consider replacing the defi- 
nition of the prevalence function in (5.1) by something like 

(5.8) 7r^,(t)= n ^s{t) n i^-^s{t)] 

with a corresponding change in its empirical version (5.2), so as to try to 
take into better account shingles that did not occur in the document being 
dated. However, the logarithm of the second factor in (5.8) is 

(5.9) Yl log{l-vr.(t)}«- vr,(t)«-j;vr,(t) = -l, 

since each 7rs{t) is small, and because the total number of possible shingles 
far exceeds those in any given document. We computed empirical versions 
of the logarithm of the second factor in (5.8) and invariably found that such 
curves stayed close to —1, and were therefore not informative. 

6. Calendaring via quantile regression (QR). A third proposal for the 
calendaring problem is based on quantile regression as follows. Suppose that 
P is a document whose date we wish to estimate. A scatterplot is produced 
of the distances Dist(X',Pi) from V to each of the documents € T in 
a training set, against the known dates t(T>i) of those training set docu- 
ments. A nonparametric quantile regression (QR) curve is then fit to this 
scatterplot, and the date at which this QR plot attains its minimum value 
is taken as the estimate of the date of T>. QR algorithms typically have two 
parameters: a bandwidth h which controls the smoothness of the curve and 
a quantile < g < 1. (The bandwidth parameter need not be kept constant 
over the range of dates and may be larger in regions of sparser date ranges.) 
The parameters h and q are meant to be optimized for documents in a val- 
idation set which are dated using data in a training set. The procedure is 
then assessed on the documents in a held-out test set. Figure 8 in Section 8 
below illustrates the QR procedure in action. For quantile regression, our 
key reference is Koenker (2005). 
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7. Theoretical considerations. In this section we discuss some general 
considerations concerning the consistency of the estimates proposed in Sec- 
tions 4 and 5. 

Turning first to the distance-based (kNN) method, we have the following 
result: Let Pq be an undated document written at time to, and denote by 
T> a dated document, written at time T, and chosen at random from a 
potentially infinite (but representative) training set and having a random 
distance A from Dq. (For simplicity, we assume that our kNN procedure is 
based on only a single distance measure, but the general case is similar.) We 
posit five conditions: 

(i) Asymptotic unbiasedness: The conditional expectation of the mean 
of T converges to to over neighborhoods A — > 0. 

(ii) Bounded variance: The second moment of T remains bounded as 
these distance neighborhoods shrink to 0. 

(iii) A technical condition: A can be viewed as possessing a density at 
the origin which is continuous and positive. 

(iv) The kernel K{u) is bounded, continuous, compactly supported and 
nonincreasing on the positive real line, with K{0) > 0. 

(v) The number of elements in the training set increases sufficiently 
quickly as the bandwidth h tends to 0. 

Under the conditions (i)-(v), it was proved in FHTG (2008) that the esti- 
mator t defined at (4.6) is a consistent estimator of the true date to of the 
document Pq; that is, t -^p to as the size of the training set tends to infinity. 

Turning to the MP method, consistency results may be established along 
the following lines. Assume time to be integer valued and restricted to a 
compact domain: t^i^i < i < ^max- We again let Vq denote the document to 
be dated, and to is its unknown true date. We consider our training set to 
be an increasing (n — > oo) sequence of documents Tn = {Vi^V^, . . . ,Vn} in 
which the random documents Pj, and their corresponding random dates Tj, 
are viewed as being an i.i.d. sample from an essentially infinite population 
generically represented by the random object (P,T). The set of all shingles 
possible at any point within our time interval will be denoted by S. (The 
shingle size is considered fixed.) Every shingle s G 5 has associated with it its 
probabilities TTsit) of being drawn at any of the time points t. Note that for 
each t we will have '^s&s'^s{t) ~ ^- next assume that if ti i^t^, then the 
collections {7r<i(ti); s G 5} and {t\ s(t2)\ s ^ are not identical; specifically, 
T^si^i) 7^ 7rs(t2) for some s G iS. 

In the random object (V,T), we will assume that, conditionally on T = t, 
the sequence of shingles comprising D is an i.i.d. sample drawn from S under 
the probability distribution {vrs(t) : s G S}. In particular, the shingles of 
are assumed to be randomly drawn from S using the distribution iTsito)- In 
(T>,T) the length of V is assumed to be independent of T. 
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Now, for each s G 5, under standard conditions for the Nadaraya-Watson 
estimator, we will have 

(7.1) sup |7rs(i) — 7rs(t)| — as n — > oo, 

ax 

so that for a Vq of fixed, finite length, we will have 

(7.2) sup \7rx>o{t) — 7rx>o{t)\ ^ asn— >cx). 

On the other hand, the standard argument (based on the Law of Large 
Numbers and Kullback-Leibler distance) which is used to prove consistency 
of the MLE in the case when the parameter space is finite applies equally 
here and allows us to conclude that with arbitrarily high probability, vtxiq (t) 
will take on its maximum value uniquely at to provided only that \T>q\ is 
sufficiently large. Hence, by requiring {VqI to be sufficiently large, and then 
letting n — > cx), the estimated date t of Vq can be made to equal tQ with 
arbitrarily high probability. 

Of course, asymptotic results do need to be assessed for relevance in any 
specific application. In particular, it must be borne in mind that any docu- 
ment to be dated will be of finite length and so will necessarily contain only 
limited "Fisher information" for the estimation of its date parameter. 

8. Numerical work. In this section we describe some numerical experi- 
ments which we conducted using the kNN and MP estimation methods with 
the DEEDS data set. This work was carried out using a combination of 
UNIX commands together with the C programming language, as well as the 
R statistical computing package. 

For the purposes of our experiments, we first randomly partitioned the 
3353 DEEDS documents which were available to us into a training set T, 
a validation set V and a test set A, with these sets having cardinalities 
\T\ = 2608, |V| = 419, and |^| = 326. Unlike the MP method, however, our 
experiments with the kNN method as described in Section 4 did not require 
a validation set because in that method the parameters for dating any given 
document are determined solely from its neighbors within the training set, 
as well as from other members of the training set. Therefore, for our kNN 
numerical work, V and A were combined to form a larger test set consisting 
of 745 documents. 

Our experiments with the kNN method were based on shingle sizes 1, 2 
and 3, as well as on all combinations of these sizes. We used the distance 

(4.3) with a = 1, based on the similarity (4.2), and therefore a proper metric; 
this distance was computed using argument vectors "p" and "g" consisting 
of raw (i.e., unnormalized) shingle counts. These distances are denoted as 
'^fc(^)i)) with k representing the shingle sizes on which they are based. 
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Table 2 

Performance of the kNN and MP methods on the DEEDS data set 



Dating 


Shingle 


Optimal 


VMSE 


MAE 


MedAE 


method 


lengths 


parameters 


(val., test) 


(val., test) 


(val., test) 


Ml 


1 


ft = 8, df=5 


18.3, 19.8 


11.7, 12.5 


7.0, 8.0 


M2 


2 


ft = 12, df=3 


14.8, 14.7 


9.5, 9.0 


6.0, 6.0 


M3 


3 


ft = 12, df=5 


17.0, 15.4 


10.1, 9.5 


6.0, 6.0 


M4 


4 


ft = 16, df=12 


18.8, 22.8 


11.5, 12.4 


7.0, 7.0 


M1234 


1-4 




14.3, 14.5 


9.3, 9.2 


6.0, 6.0 


kNNl 


1 


m = 1000 


20.1 


12.3 


6.4 


kNN2 


2 


m = 500 


23.7 


13.8 


6.4 


kNN3 


3 


m = 500 


28.3 


16.6 


7.6 


kNN 12 


1 & 2 


m = 100 


20.2 


12.1 


6.3 


kNN 13 


1 & 3 


m = 100 


21.7 


12.9 


7.0 


kNN23 


2 & 3 


m= 100 


25.5 


14.9 


6.8 


kNN123 


1 & 2 & 3 


m = 10 


25.4 


15.0 


7.9 



For a given document Vi in our test set of 745 documents, all 2608 of its 
distances to the documents in the training set were computed for each of the 
three shingle sizes. The set /C(i) of neighbors to Vi was formed by taking all 
indices j £T such that di^{i,j) is among the m smallest values of that dis- 
tance. When multiple shingle sizes were used, the set of neighbors IC{i) was 
taken to be the union over the m smallest distances for each of the shingle 
sizes used. As the kNN procedure was not very sensitive to the exact choice 
of m, the values of m we experimented with were 5, 10, 20, 100, 500 and 
1000. (The smaller m values, of course, result in faster computation times.) 
The optimal bandwidths for use with Vi were then determined (entirely 
from within the training set) using the procedure defined at (4.7) together 
with a standard Gaussian kernel at (4.5). For each m and Pj, these band- 
widths were determined by searching over a one-, two- or three- dimensional 
grid, depending on the number of shingle lengths used in the procedure; the 
optimal bandwidths so resulting were therefore different for each Di . Finally, 
we computed the RMSE (root mean squared error), MAE (mean absolute 
error) and MedAE (median absolute error) performance measures for the 
resulting date estimates of the 745 documents in our (enlarged) test set. 

The results of these computations are summarized in the last seven rows 
of Table 2, labeled kNNl (based on shingle size 1) to kNN123 (based on 
using shingle sizes 1, 2 and 3 simultaneously). Among these, the combination 
kNN12 is seen to be best, although kNNl performed similarly in terms of 
MAE, while kNNl and kNN2 performed similarly in terms of MedAE. The 
optimal choices for m are also shown in the table. The apparent deterioration 
of performance for kNN123 appears related to the fact that m was held equal 
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Fig. 4. Estimated versus true dates for the 745 documents in the test set, using the kNN 
method with m — 100 and combining shmgle lengths 1 and 2. The solid line is "y — x." 

for all three shingle sizes. The relatively large values of RMSE occur because 
a small number of documents could not be dated at all accurately. By way 
of comparison, the mean year for the training documents was approximately 
1246, and if this value were used to estimate the dates of the documents in 
the test set, the RMSE would be 47, the MAE would be 37, and the MedAE 
would be 25. 

Figure 4, as an example, shows the estimated versus the (presumed) true 
dates for the 745 documents in the test set for the kNN12 procedure based 
on m = 100. This figure evidences some degree of edge bias, with early docu- 
ments having overestimated dates and later ones having somewhat underes- 
timated dates. This bias is due to the one-sided nature of nearest neighbors 
at the edges. 

Our experiments with the maximum prevalence (MP) methods required 
all three of the sets T, V and A. To save computational labor, we imple- 
mented only the locally constant (i.e., Nadaraya- Watson type) version (5.5) 
for estimating the shingle probability functions; we used the t-distribution 
kernel K{x) = (1 + x^/z/)"^*^"^^^/^. For each of the shingle sizes 1, 2, 3 and 4, 
optimal values of the bandwidth h and degrees of freedom parameter were 
determined by optimizing the date estimates for the documents in the val- 
idation set using the training data. Finally, the performance measures were 
computed on both the validation and the test set using the parameters that 
were determined on the validation set. These results are shown on the rows 
labeled Ml, M2, M3 and M4 of Table 2. For each of these methods, the 
optimized parameter values are shown, and the RMSE, MAE and MedAE 
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Fig. 5. Estimated probability function TTsit) for the shingle testimonium huic based on 
degrees of freedom u = 3 and bandwidth h= 12. The points are the relative frequencies for 
this shingle at each date. 



performance measures are given for both the vahdation and the test set data. 
The best performing of these methods was that based on shingle size 2 (i.e., 
method M2), with a median absolute error of 6.0. The shingle size 2 is, in 
some sense, the best compromise (for a data set of this size) between hav- 
ing the deeper information content inherent in longer shingles and having 
enough of them. The RMSE and MAE figures are again inflated due to the 
presence of a small number of documents that could not be dated accurately. 

Figures 5, 6 and 7 exemplify the main components of the MP procedure. 
Figure 5 shows an estimated probability function 7rs(t) for the 2-shingle 
testimonium huic ("in witness to which") based on a t-distribution kernel 
with bandwidth h = 12 and degrees of freedom v = 3. The points on this 
graph are the occurrence proportions for this shingle over time, and the 
concentration of points at the bottom of the graph correspond to years in 
which this shingle did not occur. Figure 6 is a plot of the logarithm of 
a typical prevalence curve vrx)(t), based on shingle size k = 2, using four 
different bandwidths, and a document "D in the test set (consisting of 87 
words) whose true date is 1299. The MP estimate for this document is 1307; 
we note that (as was typically the case) the resulting date estimate is not 
unduly sensitive to the exact bandwidth chosen. Figure 7 is a plot of the 
estimated versus the true dates for the 326 documents in the test set using 
the M2 method. Such edge bias as occurs could likely be reduced by using 
the more computationally intensive locally linear smoothing as in equations 
(5.6) and (5.7). 
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Fig. 8. Quantile regressions ( QR) for (lower) quantiles q = 0.1 and q = 0.05, using band- 
widths h — 30 (solid lines) and h — 10 (dashed lines). The points are distances from the 
document being dated to documents in the test set, plotted against the true dates of the 
test documents. The vertical line is at the true date, 1261. 



We also attempted to combine the methods M1-M4 using a weighted 
average determined by minimizing MSE (mean squared error) over the val- 
idation set (subject to a constraint that the weights sum to 1). The weights 
for the resulting method, labeled M1234 in Table 2, were found to be 0.14, 
0.64, 0.12 and 0.10. The results for this method were not much better than 
for M2 alone. 

Our experiments with the QR method were less successful than for the 
kNN and MP methods. While the QR method did generally provide mean- 
ingful estimates, error variation was higher than for kNN or MP, particularly 
for documents whose dates were in the upper or lower date ranges where 
test data was relatively sparse. Figure 8 provides an illustration of the QR 
method using a document T> consisting of 336 words whose true date is 1261 
and a test set of 2608 documents Pj. In this plot of the distances Dist(I', T>i) 
versus the dates t{'Di), four quantile regression curves are drawn. The two 
solid lines correspond to bandwidth h = 30, and the (lower) quantiles q = 0.1 
and q = 0.05, and lead to date estimates of 1256 and 1252, respectively; the 
two dashed lines correspond to bandwidth h = 10, and (lower) quantiles 
q = 0.1 and q = 0.05, and give date estimates 1240 and 1241. Note that this 
plot is truncated at the far right where the number of training documents 
is too small to permit estimation of the quantile curves at all reliably. 

In a final series of experiments, we attempted to combine the results of the 
kNN and MP methods. For example, linearly combining M2 and kNN12 over 
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the validation set using an RMSE criterion, the optimal weights were found 
to be 0.83 and 0.17, and the RMSE over the test set dropped slightly to 
13.5 years. The other performance measures, however, were not significantly 
changed. 

9. Discussion. The problem which motivated this work leads to interest- 
ing technical questions and novel techniques, linking statistical methods to 
work associated with information retrieval. Automated (i.e., computerized) 
calendaring and temporal sequencing of text-based documents are known 
to be difficult problems. In the case of the DEEDS charters, however, two 
features allow for progress to be made. First, we have available a large (and 
increasing) training set of documents whose dates are accurately known. 
And second, the documents in question all have relatively similar formulaic 
structure. 

We remark that the methods we have described can be applied to any 
collection of documents and have potential applications broader than the 
one which motivated this study. For instance, as indicated in FHTG (2005), 
when suitable training data is available kNN-based methods can be adapted 
to detect other types of missing attributes, such as authorship, potentially 
providing a methodology complementary to that of Hosteller and Wallace 
(1963). Another potential application is in the detection of forgeries, a prob- 
lem related to that of establishing chronology in that a common purpose of 
forgery is to alter past intent. It is known that the number of forged English 
medieval charters is not small. One difficulty of this task, however, is the 
fact that multiple and legitimate rewritings of documents have been made 
by scribes who may have modernized or slightly altered the language of the 
documents being transcribed. We also hope that the methods proposed here 
may help determine more precise chronologies in other contexts as well. 

Of the methods investigated, we found that the MP method performed 
best. This appears to be due to its more detailed sensitivity to the behavior 
of individual shingles over time. For example, the MP method was more 
effective in discounting very commonly occurring shingles, since their occur- 
rence probabilities were relatively more constant over time. In our numerical 
work, we also encountered two somewhat surprising results. The first is that 
of the shingle sizes we worked with, shingles of size 1 resulted in estimates 
not unduly far from the best results; shingles of size 2 were better, but not 
by a large margin. The second is that (to within the scale of our exper- 
iments) combining multiple shingle sizes and combining methods did not 
lead to striking improvements. Taken together, these observations appear to 
suggest that, for determining chronology, "single words suffice." 

We are, however, not convinced that this observation will be sustained 
by further work. As the size of the DEEDS data set grows and as our com- 
puting resources increase, it will become possible to carry out estimation 
using larger training sets, using additional methods of estimation and using 
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more distances. The situation is analogous to that encountered in the col- 
laborative filtering problem of the Netflix contest where a blend of no fewer 
than 800 methods and variations was needed by the winning team. [See, e.g., 
Feuerverger, He and Khatri (2012).] Thus, with more data, we expect further 
progress to be possible via ensemble-type methods and by blending methods 
differently across strata of the data; see, for example, Hastie, Tibshirani and 
Friedman (2009), Chapter 16. Further, with additional data, it will become 
feasible to carry out optimization by referring undated documents to other 
documents of their specific type only (i.e., grant, lease, agreement, etc.), and 
thus to tune the estimation procedures according to document type. While 
further accuracy thus surely seems possible, there must also be some practi- 
cal limit to what can be achieved via purely automated means, particularly 
because any document to be dated is of finite length, and therefore carries 
only a limited amount of "information" regardless of the amount of training 
data available. While accuracies so far attained suffice to make a material 
difference to historians studying that era, the ultimate goal of the DEEDS 
project is to try to attain an accuracy of about ±3 years of error 95% of the 
time. 

We also expect that further progress could be made on the definition of 
distances between documents. One observation we offer is that such distances 
should not be regarded as absolute, but rather as relative to a particular col- 
lection of documents. In this regard, the Multiplicador Total method of R. 
Fiallos seems particularly suggestive. A highly effective distance between 
pairs of documents should take into account all matching patterns between 
them, as well as the lengths, lifetimes, currencies and other relevant fea- 
tures that these matching patterns possess within the context of the whole 
document collection. Related to this is the degree of informativeness of shin- 
gles. For example, Luhn (1958) suggests that shingles which occur neither 
too frequently nor too rarely will tend to be the most informative. As we 
had mentioned, our MP method does tend to discount the very frequently 
occurring shingles, but it does not discount the very rare ones. 

The history of the DEEDS project is not yet fully written and there is 
no doubt other techniques for the calendaring problem will be explored. 
For instance, in ongoing work, we are exploring ways in which collections 
of documents can be correctly sequenced in time (to within time-reversal), 
without regard to any of the dates associated with them. We are also ex- 
ploring ways in which methods such as neural networks and support vector 
machines might be applied to such calendaring problems. 

Remarkably, during the time this work was being carried out, a medieval 
English charter was discovered in a forgotten drawer of a library at Brock 
University (near Niagara Falls), a discovery which resulted in a certain 
amount of local media fanfare. This document records a land grant from 
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a certain Robert of Clopton to his son William. Attempts by historians us- 
ing paleography (analysis of handwriting) , content and other means initially 
attributed this document to the 14th century, and subsequently to the 13th 
century. More careful work by Robin Sutherland-Harris (a Ph.D. student of 
Medieval Studies at the University of Toronto), based on the Patent Rolls 
(administrative orders of the king) and the eyre records (records of the 
itinerant courts), suggests a date range of 1235-1245, and perhaps, more 
precisely, 1238-1242. These estimates are believed to be reliable; a compari- 
son document — believed to belong to the same time period — was also found 
and was dated 1239. We dated this charter via maximum prevalence (the 
most reliable among the methods we have discussed) using our training set 
of 2608 documents; the date estimate we obtained was 1246. 
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